Duplicate based schema matching
نویسنده
چکیده
The integration of independently developed data sources poses many problems, which are the result of several types of heterogeneity. One of the most daunting challenges is schema matching, which is the semi-automatic process of detecting semantic relationships between attributes in heterogeneous schemata. Various solutions that exploit schema information or extract specific features from attribute values have been described. In this thesis we propose novel schema matching algorithms that exploit fuzzy duplicates, i.e., different representations of the same real-world entity. We describe the DUMAS table matcher, whose goal is to establish attribute correspondences between two tables. Finding the duplicates that can be used for schema matching is a challenging task because the semantic relationships between the tables are unknown, and thus, existing duplicate detection solutions cannot be applied. We discuss the novel problem of duplicate detection in unaligned relations and describe an algorithm that is able to detect the top-k duplicates. The attribute correspondences between the two tables are extracted from those duplicates in a subsequent step. The DUMAS schema matcher extends the duplicate-based matching approach to complex schemata consisting of multiple tables. Finding attribute correspondences between complex schemata poses several new challenges that do not occur when single tables are to be matched, and thus, complicate the application of the table matcher. We describe heuristics used to determine if a table matching can be trusted, and develop an algorithm that exploits multitable duplicates to detect correspondences between complex schemata. The previous two algorithms are restricted to simple (i.e, 1:1) correspondences. Because complex (i.e., 1:n or m:n) do occur in practice, we developed the DUMAS complex matcher. The matcher uses the result of the DUMAS table matcher and improves the matching by merging certain attributes, and thus, detecting complex correspondences. Because the space of possible complex matchings is very large, we devised several heuristics to decrease the number of attribute combinations that have to be considered.
منابع مشابه
An Improved Semantic Schema Matching Approach
Schema matching is a critical step in many applications, such as data warehouse loading, Online Analytical Process (OLAP), Data mining, semantic web [2] and schema integration. This task is defined for finding the semantic correspondences between elements of two schemas. Recently, schema matching has found considerable interest in both research and practice. In this paper, we present a new impr...
متن کاملEliminating NULLs with Subsumption and Complementation
In a data integration process, an important step after schema matching and duplicate detection is data fusion. It is concerned with the combination or merging of different representations of one real-world object into a single, consistent representation. In order to solve potential data conflicts, many different conflict resolution strategies can be applied. In particular, some representations ...
متن کاملRecord Matching Over Query Results Using Fuzzy Ontological Document Clustering
Record matching is an essential step in duplicate detection as it identifies records representing same real-world entity. Supervised record matching methods require users to provide training data and therefore cannot be applied for web databases where query results are generated on-the-fly. To overcome the problem, a new record matching method named Unsupervised Duplicate Elimination (UDE) is p...
متن کاملA Semi Automatic Tool For Schema Mapping
neric mapping framework at the schema level to address the problem of schema interoperability Providing a formalism for developing a generic, extensible, and semi-automated mapping A semi-automatic tool for schema mapping. at the University of Washington in Seattle, where he founded the database group. on Clio, the first semi-automatic tool for heterogeneous schema mapping. Keywords: data integ...
متن کاملFinding nontrivial semantic matches between database schemas
Finding nontrivial semantic matches between database schemas 3 Summary Automation of schema matching has been under investigation for already some decades, still the systems usually do not find all matches or suggests incorrect matches. Due to this imperfection matching schemas it is still often done manually by domain experts. The rapidly increasing number of heterogeneous and distributed data...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006